AITopics | sparse dataset

Collaborating Authors

sparse dataset

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Scaling Up Differentially Private LASSO Regularized Logistic Regression via Faster Frank-Wolfe Iterations

Neural Information Processing SystemsFeb-14-2026, 07:31:07 GMT

Note that lower values of ϵ and δ correspond to stronger privacy.

algorithm, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Santa Clara (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Genre:

Research Report > New Finding (0.65)
Research Report > Experimental Study (0.41)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

Data Fusion of Deep Learned Molecular Embeddings for Property Prediction

Appleton, Robert J, Barnes, Brian C, Strachan, Alejandro

arXiv.org Artificial IntelligenceOct-29-2025

Data - driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability . To improve predictions, techniques such as transfer learning and multi - task learning have been used. T he performance of multi - task learning models depend s on the strength of the underlying correlations between tasks and the completeness of the dataset . S tandard multi - task models tend to underperform when trained on sparse datasets with weakly correlated properties. To address this gap, we fuse deep - learned embeddings generated by independent pre - trained single - task models, resulting in a multi - task model that inherit s rich, property - specific representations. By re - using (rather than re - training) these embeddings, the resulting fused model outperforms standard multi - task models and can be extended with fewer trainable parameters . We demonstrate this technique on a widely used benchmark dataset of quantum chemistry data for small molecules as well as a newly compiled sparse dataset of experimental data collected from literature and our own quant um chemistry and thermochemical calculations.

artificial intelligence, distribution statement, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1021/acs.jcim.5c01728

2504.07297

Country: North America > United States > Indiana (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Government > Military (0.68)
Government > Regional Government (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scaling Up Differentially Private LASSO Regularized Logistic Regression via Faster Frank-Wolfe Iterations

Neural Information Processing SystemsOct-8-2025, 21:53:11 GMT

Note that lower values of ϵ and δ correspond to stronger privacy.

algorithm, artificial intelligence, machine learning, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Santa Clara (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Genre:

Research Report > New Finding (0.65)
Research Report > Experimental Study (0.41)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

A Comparative Study of Spline-Based Trajectory Reconstruction Methods Across Varying Automatic Vehicle Location Data Densities

Robbennolt, Jake, Munira, Sirajum, Boyles, Stephen D.

arXiv.org Artificial IntelligenceSep-3-2025

Automatic vehicle location (AVL) data offers insights into transit dynamics, but its effectiveness is often hampered by inconsistent update frequencies, necessitating trajectory reconstruction. This research evaluates 13 trajectory reconstruction methods, including several novel approaches, using high-resolution AVL data from Austin, Texas. We examine the interplay of four critical factors -- velocity, position, smoothing, and data density -- on reconstruction performance. A key contribution of this study is evaluation of these methods across sparse and dense datasets, providing insights into the trade-off between accuracy and resource allocation. Our evaluation framework combines traditional mathematical error metrics for positional and velocity with practical considerations, such as physical realism (e.g., aligning velocity and acceleration with stopped states, deceleration rates, and speed variability). In addition, we provide insight into the relative value of each method in calculating realistic metrics for infrastructure evaluations. Our findings indicate that velocity-aware methods consistently outperform position-only approaches. Interestingly, we discovered that smoothing-based methods can degrade overall performance in complex, congested urban environments, although enforcing monotonicity remains critical. The velocity constrained Hermite interpolation with monotonicity enforcement (VCHIP-ME) yields optimal results, offering a balance between high accuracy and computational efficiency. Its minimal overhead makes it suitable for both historical analysis and real-time applications, providing significant predictive power when combined with dense datasets. These findings offer practical guidance for researchers and practitioners implementing trajectory reconstruction systems and emphasize the importance of investing in higher-frequency AVL data collection for improved analysis.

artificial intelligence, machine learning, trajectory, (14 more...)

arXiv.org Artificial Intelligence

2509.00119

Country: North America > United States > Texas > Travis County > Austin (0.48)

Genre: Research Report > New Finding (0.87)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground (0.93)
Transportation > Passenger (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Design of an basis-projected layer for sparse datasets in deep learning training using gc-ms spectra as a case study

Chang, Yu Tang, Chen, Shih Fang

arXiv.org Artificial IntelligenceMar-14-2024

Deep learning (DL) models encompass millions or even billions of parameters and learn complex patterns from big data. However, not all data are initially stored in a suitable formation to effectively train a DL model, e.g., gas chromatography-mass spectrometry (GC-MS) spectra and DNA sequence. These datasets commonly contain many zero values, and the sparse data formation causes difficulties in optimizing DL models. A DL module called the basis-projected layer (BPL) was proposed to mitigate the issue by transforming the sparse data into a dense representation. The transformed data is expected to facilitate the gradient calculation and finetuned process in a DL training process. The dataset, example of a sparse dataset, contained 362 specialty coffee odorant spectra detected from GC-MS. The BPL layer was placed at the beginning of the DL model. The tunable parameters in the layer were learnable projected axes that were the bases of a new representation space. The layer rotated these bases when its parameters were updated. When the number of the bases was the same as the original dimension, the increasing percentage of the F1 scores was 8.56%. Furthermore, when the number was set as 768 (the original dimension was 490), the increasing percentage of the F1 score was 11.49%. The layer not only maintained the model performance and even constructed a better representation space in analyzing sparse datasets.

dataset, pattern-sparse data, spectra, (15 more...)

arXiv.org Artificial Intelligence

2403.09188

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

Scaling Up Differentially Private LASSO Regularized Logistic Regression via Faster Frank-Wolfe Iterations

Raff, Edward, Khanna, Amol, Lu, Fred

arXiv.org Machine LearningOct-30-2023

To the best of our knowledge, there are no methods today for training differentially private regression models on sparse input data. To remedy this, we adapt the Frank-Wolfe algorithm for $L_1$ penalized linear regression to be aware of sparse inputs and to use them effectively. In doing so, we reduce the training time of the algorithm from $\mathcal{O}( T D S + T N S)$ to $\mathcal{O}(N S + T \sqrt{D} \log{D} + T S^2)$, where $T$ is the number of iterations and a sparsity rate $S$ of a dataset with $N$ rows and $D$ features. Our results demonstrate that this procedure can reduce runtime by a factor of up to $2,200\times$, depending on the value of the privacy parameter $\epsilon$ and the sparsity of the dataset.

algorithm, dataset, iteration, (14 more...)

arXiv.org Machine Learning

2310.19978

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

Dealing with Sparse Datasets in Machine Learning

#artificialintelligenceOct-26-2022, 05:17:05 GMT

This article was published as a part of the Data Science Blogathon. Missing data in machine learning is a type of data that contains null values, whereas Sparse data is a type of data that does not contain the actual values of features; it is a dataset containing a high amount of zero or null values. It is a different thing than missing data. Sparse datasets with high zero values can cause problems like over-fitting in the machine learning models and several other problems. That is why dealing with sparse data is one of the most hectic processes in machine learning.

algorithm, dataset, sparse dataset, (11 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.52)

Add feedback

Yet Another Library for Deep Learning You Should Know About

#artificialintelligenceDec-15-2020, 15:17:38 GMT

It has many algorithms, supports sparse datasets, is fast and has many utility functions, like cross-validation, grid search, etc. When it comes to advanced modeling, scikit-learn many times falls shorts. If you need Boosting, Neural Networks or t-SNE, it's better to avoid scikit-learn. While MLPClassifier and MLPRegressor have a rich set of arguments, there's no option to customize layers of a Neural Network (beyond setting the number of hidden units for each layer) and there's no GPU support. While there are already superior libraries available like PyTorch or Tensorflow, scikit-neuralnetwork may be a good choice for those coming from a scikit-learn ecosystem.

deep learning, neural net, pipeline, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Deep Learning with scikit-learn

#artificialintelligenceAug-21-2020, 11:35:40 GMT

It has a good set of algorithms, supports sparse datasets, it is fast and has many utility functions, like cross-validation, grid search, etc. When it comes to advanced modeling, scikit-learn many times falls shorts. If you need Boosting, Neural Networks or t-SNE, it is better to avoid scikit-learn. There is MLPClassifier for classification and MLPRegressor for regression. While both have a rich set of arguments, there isn't an option to customize layers of a Neural Network (beyond setting the number of hidden units for each layer).

artificial intelligence, machine learning, neural net, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Chromatic Learning for Sparse Datasets

Feinberg, Vladimir, Bailis, Peter

arXiv.org Machine LearningJun-6-2020

Learning over sparse, high-dimensional data frequently necessitates the use of specialized methods such as the hashing trick. In this work, we design a highly scalable alternative approach that leverages the low degree of feature co-occurrences present in many practical settings. This approach, which we call Chromatic Learning (CL), obtains a low-dimensional dense feature representation by performing graph coloring over the co-occurrence graph of features---an approach previously used as a runtime performance optimization for GBDT training. This color-based dense representation can be combined with additional dense categorical encoding approaches, e.g., submodular feature compression, to further reduce dimensionality. CL exhibits linear parallelizability and consumes memory linear in the size of the co-occurrence graph. By leveraging the structural properties of the co-occurrence graph, CL can compress sparse datasets, such as KDD Cup 2012, that contain over 50M features down to 1024, using an order of magnitude fewer features than frequency-based truncation and the hashing trick while maintaining the same test error for linear models. This compression further enables the use of deep networks in this wide, sparse setting, where CL similarly has favorable performance compared to existing baselines for budgeted input dimension.

artificial intelligence, data mining, machine learning, (17 more...)

arXiv.org Machine Learning

2006.03779

Genre: Research Report (0.82)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback